graph : remove redundant GDN state transposes#20443
Conversation
|
This first or the Vulkan implementation first? It's gonna be stuck waiting for the CI forever, if we have to keep adapting it before it has a chance to finish. Otherwise I have to merge it without waiting for the CI. |
|
@0cc4m Yes, proceed with the Vulkan implementation. Sorry for the rapid changes. |
|
This currently speeds-up TG on NVGPU due to coalescing the writes (in TG we process n_token=1 and afterwards write the states back, whereas in PP we process ubatch tokens and write states back at the end). Uncoalesced reads are not so important on big caches, and I thus saw no trashing occur (though on pre-Ada GPUs this may happen). Further PP improvements will have to revolve around batching data-accesses (either process > 1 col per warp and do wider loads or unroll |
…20436) The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix column-wise on row-major storage, causing strided reads (stride S_v = 128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a 39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused path. Transpose the state indexing so threads read contiguously: - Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v) - CUDA: curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced) - CPU: restructured loops for row-wise transposed access Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so users can control fused GDN independently of auto-detection. All GATED_DELTA_NET backend-ops tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ed flags - Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized dot products in the CPU fused GDN kernel (delta and attention output) - Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one path lacks device support, disable both to prevent state layout mismatch between transposed (fused) and non-transposed (unfused) formats Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7ea6ee4 to
fe3ef4a
Compare
|
We may consider removing the obsolete smem code in this PR (we unified HIP/CUDA paths in #20340 via #20391, and removed the SMEM as part of this). Can also file a separate PR if preferred diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
index eb3858820..1ce6d5f31 100644
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
@@ -135,15 +135,6 @@ __global__ void gated_delta_net_cuda(const float * q,
}
}
-static size_t calculate_smem(const int sv, int cc)
-{
- size_t smem = 0;
- if ((GGML_CUDA_CC_IS_AMD(cc) && !GGML_CUDA_CC_IS_RDNA3(cc) && !GGML_CUDA_CC_IS_RDNA4(cc)) || GGML_CUDA_CC_IS_MTHREADS(cc)) {
- smem = sv * sv * sizeof(float);
- }
- return smem;
-}
-
template <bool KDA>
static void launch_gated_delta_net(
const float * q_d, const float * k_d, const float * v_d,
@@ -180,18 +171,14 @@ static void launch_gated_delta_net(
sb1, sb2, sb3, neqk1_magic, rq3_magic, scale);
break;
case 64: {
- constexpr int sv = 64;
- size_t smem = calculate_smem(sv, cc);
- gated_delta_net_cuda<sv, KDA><<<grid_dims, block_dims, smem, stream>>>(
+ gated_delta_net_cuda<64, KDA><<<grid_dims, block_dims, 0, stream>>>(
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1_magic, rq3_magic, scale);
break;
}
case 128: {
- constexpr int sv = 128;
- size_t smem = calculate_smem(sv, cc);
- gated_delta_net_cuda<sv, KDA><<<grid_dims, block_dims, smem, stream>>>(
+ gated_delta_net_cuda<128, KDA><<<grid_dims, block_dims, 0, stream>>>(
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1_magic, rq3_magic, scale);
|
|
This change made qwen3.5 about 10% slower for tg in the vulkan backend, with CPY getting 3x more expensive. We may be hitting some slow path, I'll look into it. |
|
I saw something like that, but thought it best to wait until we get some more shader improvements in. |
|
I don't think it was the copy shader, some driver issue was giving me bad timestamps. The timings I'm seeing now show the GDN op several times slower, but I'm not totally confident in the data. |
|
This PR restored a tg regression on Strix Halo for Qwen MoE models using ROCm backend. Thanks for the PR and to the original reporter for the issue. It seems that gfx1151 is particularly sensitive to the changes being made for GDN support. Perhaps related to susceptibility to register pressure. I'm wondering if it's possible for pointed testing for that architecture specifically given the apparent sensitivity. Fully acknowledging the combinatorics explosion of needing to test a wide variety of architectures and the slippery slope, but this is a pointed request specifically for GDN related changes given the apparent platform sensitivity and the relative popularity of the architecture. Ideally, GDN related updates in llama.cpp should help with performance across the board, but avoiding regressions would be a nice baseline :) Will defer to @IMbackK for insights on change management in relation to AMD architectures. Cheers. |




cont #20437
fix #20436
As correctly noted in #20436, there is no need to transpose the recurrent state in the GDN computation. Simplify the ggml graph for the unfused path + optimize the fused kernels using coalesced read/writes.
TODOs: